A Low-Resourced Peruvian Language Identification Model

نویسندگان

  • Alexandra Espichán-Linares
  • Arturo Oncevay-Marcos
چکیده

Due to the linguistic revitalization in Perú through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as ngrams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex tasks in the future.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EFL Teachers’ Corrective Feedback and Students’ Revision in a Peruvian University: A descriptive study

This study explored the EFL teachers’ written corrective feedback (CF) techniques and their EFL students’ ability to integrate the CF while revising their texts. A total of 72 EFL students and 4 EFL teachers participated in this study. The data were collected through explicitation interviews administered to teachers and students, as well as through students’ written productions. A content analy...

متن کامل

EFL Teachers’ Corrective Feedback and Students’ Revision in a Peruvian University: A descriptive study

This study explored the EFL teachers’ written corrective feedback (CF) techniques and their EFL students’ ability to integrate the CF while revising their texts. A total of 72 EFL students and 4 EFL teachers participated in this study. The data were collected through explicitation interviews administered to teachers and students, as well as through students’ written productions. A content analy...

متن کامل

Modeling code-Switching speech on under-resourced languages for language identification

This paper presents an integration of phonotactic information to perform language identification (LID) in a mixed-language speech. A single-pass front-end recognition system is employed to convert the spoken utterances into a statistical occurrence of phone sequences. To process such phone sequences, a hidden Markov model (HMM) is utilized to build robust acoustic models that can handle multipl...

متن کامل

Language identification of code Switching sentences and multilingual sentences of under-resourced languages by using multi structural word information

Language identification (LID) is a process to identify the languages used in a text or speech. Code switching is the switching of a language in a sentence or speech utterance. This paper focuses on LID of words in code switching sentences. Code switching can occur intersentential or intrasentential. The reasons why a writer switches from one language to another due to various reasons and among ...

متن کامل

Language Identification for Under-Resourced Languages in the Basque Context

Automatic Speech Recognition (ASR) is a broad research area that absorbs many efforts from the research community. The interest on Multilingual Systems arouses in the Basque Country because there are three official languages (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. The development of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017